In Part I, we applied NLTK to James Joyce's Ulysses and found some interesting features of Chapter 8, Lestrygonians. We started by analyzing characters and letter frequencies, and then moved on to words. In this notebook, we'll be looking at phrases.
In particular, we'll try and improve the part of speech tagger by looking at the text at the phrase level, and we'll also apply some chunking algorithms to the text to chunk words into phrases based on their parts of speech.
Let's start by importing our libraries.
In [2]:
# In case we want to plot something:
%matplotlib inline
from __future__ import division
import nltk, re
import numpy as np
# The io module makes unicode easier to deal with
import io
def p():
print "-"*20
In [3]:
file_contents = io.open('txt/08lestrygonians.txt','r').read()
print type(file_contents)
In [4]:
# Tokenize the chapter using the Punkt Tokenizer:
sentences = nltk.sent_tokenize(file_contents)
print len(sentences)
In [5]:
print sentences[:21]
Now that we've tokenized the text by sentence, we can set to work. The first useful task we'll want to be able to do is to print out a sentence if it contains a given word. We can use a Text
object and use the concordance('word')
function, but this only prints out the context - and does not return the sentence or its context.
Suppose we want to search for a word, like "eye", and we want to return the sentence that contains it, along with two sentences of context (the sentence before, and the sentence after).
We can do this by looping through each sentence, breaking it apart using a word tokenizer, and searching for the word of interest. If we find it, we add the prior sentence, current sentence, and next sentence to the list of instances.
In [6]:
small_sentences = sentences[:21]
def word_with_context(word,sentences):
final_list = []
for i,sentence in enumerate(sentences):
if i>0 and i<(len(sentences)-1):
words = nltk.word_tokenize(sentence)
if word in words:
final_list.append( [re.sub('\n',' ',sentences[i-1] ),
re.sub('\n',' ',sentences[i] ),
re.sub('\n',' ',sentences[i+1] ) ]
)
return final_list
for i in word_with_context('eyes',sentences):
p()
print '\n'.join(i)
This is a useful function that we can combine with some other conditions - such as searching a wordlist for words matching a certain pattern. Then we can pass a pattern, and get back each word matching our pattern, with three sentences of context. We'll need a wordlist first, which we can obtain by tokenizing each of our sentences.
In [7]:
wordlist = nltk.word_tokenize(file_contents)
wordlist = [w.lower() for w in wordlist]
english_words = [w for w in nltk.corpus.words.words('en') if w.islower()]
z1 = set(wordlist)
z2 = set(english_words)
In [8]:
print "Number of words in Chapter 8:",len(wordlist)
print "Number of unique words in Ch. 8:",len(z1)
print "Number of words in English dictionary:",len(z2)
print "Numer of words in Ch. 8 in English dictionary:",len( z1.intersection(z2) )
In [9]:
intersection = z1.intersection(z2)
non_dictionary_words = z1.symmetric_difference(intersection)
print len(non_dictionary_words)
In [10]:
non_dictionary_words = sorted(list(non_dictionary_words))
print non_dictionary_words[110:125]
We now have a list of words that aren't found in an English dictionary provided by the NLTK corpus, so these have the potential to be interesting words. We'll use these results to print out some context for each word.
While we're at it, we can also get word counts of each of these words using a Text object:
In [11]:
text = nltk.Text(wordlist)
print "Number of occurences of",non_dictionary_words[115],":",text.count(non_dictionary_words[113])
In [12]:
result = word_with_context(non_dictionary_words[115],sentences)
print '\n'.join(result[0])
The phrase "woman's breasts full" is reminiscent of Lady Macbeth's speech from Macbeth, Act 1 Scene 5, when she discovers Duncan is staying the night (it has a somewhat, uh, different tone):
Stop up the access and passage to remorse,
That no compunctious visitings of nature
Shake my fell purpose, nor keep peace between
The effect and it! Come to my woman’s breasts,
And take my milk for gall, you murd'ring ministers,
Wherever in your sightless substances
You wait on nature’s mischief.
- Macbeth, Act 1, Scene 5
If instead we wanted to search for words matching a regular expression, we could write a function that takes a regular expression, searches for words matching that expression, and passes them to the word_with_context()
function.
In [13]:
def re_with_context(rex,sentences):
final_list = []
for i,sentence in enumerate(sentences):
if i>0 and i<(len(sentences)-1):
words = nltk.word_tokenize(sentence)
for word in words:
if len(re.findall(rex,word))>0:
final_list.append( [re.sub('\n',' ',sentences[i-1] ),
re.sub('\n',' ',sentences[i] ),
re.sub('\n',' ',sentences[i+1] ) ]
)
return final_list
for i,ss in enumerate(re_with_context(r'ood\b',sentences)):
if i<25:
p()
print '\n'.join(ss)
Now, we are able to pass words and regular expressions, and get a few sentences of context back in return. We can use various techniques to identify keywords, or provide keywords from a file, or from a list. We could iterate through a file containing any of the following things:
We can also look for particular phonetic sounds, which often occur in groups (as we can see from the word searches above, many of the sentences are repeated because the "ood" pattern often shows up repeatedly over a few sentences.
We can also look for patterns across the chapters - something we haven't done yet, since we've been focusing on Chapter 8 alone, as a smaller and more manageable body of text.
First, let's expand on that context function, to print out N sentences of context:
In [14]:
def re_with_context(rex,sentences,n_sentences):
final_list = []
half = (int)(np.floor(n_sentences/2))
for i,sentence in enumerate(sentences):
if i>=half and i<(len(sentences) - half):
words = nltk.word_tokenize(sentence)
for word in words:
if len(re.findall(rex,word))>0:
short_list = []
for s in sentences[i-half:i] + sentences[i:i+half+1]:
short_list.append( re.sub(r'[\n\t]',' ',s) )
final_list.append(short_list)
return final_list
for group in re_with_context('eyes',sentences,5):
p()
print '\n'.join(group)
If we want to start analyzing Ulysses as a whole and look for connections across chapters, we'll need objects to store data about each chapter, objects that will encapsulate much of the functionality laid out in Part I and Part II of these notebooks.
To design such an object, a Lestrygonians
object, we would first want to define a UlyssesChapter object. The constructor would take a text file representing the chapter. There would be a number of methods to get useful lists, dictionaries, or sets.
Useful lists:
Useful dictionaries:
Useful sets:
Building wordlists:
['orange','yellow','green','blue','indigo','rose','violet']
becomes [u'blue', u'greenhouses', u'greens', u'penrose', u'orangepeels', u'bluecoat', u'orangegroves', u'greeny', u'yellow', u'bluey', u'yellowgreen', u'green', u'rose', u'blues', u'greenwich']
While it would be a long and labor-intensive process to build a grammar parser that could understand, even at a primitive level, the text of Ulysses, we can get some useful information from some very dumb tags.
In other words, by tagging various sentences, based on the words contained in them, we can find themes that run throughout the novel - from objects that appear and reappear (soap, eyes, bread) to people (Martin Cunningham, Cashel Boyle O'Connor Fitzmaurice Tisdall Farrell, Molly) and more complex phrases (references to Shakespeare, for instance, that can be detected with word overlap).
These can be organized into a frequency matrix, with rows for each word in the body of text, and columns for each word, theme, tag, etc.
You could also create a similar matrix for sentences, and based on the words in the sentence, you could create subject matter tagging.
The appearance and disappearance of various characters would be marvelous.
Even a list of characters would be great.
In [15]:
words = nltk.word_tokenize(file_contents)
print len(words)
In [16]:
import pandas as pd
In [63]:
wdf = pd.DataFrame(columns=['words'],data=words[:50])
wdf['cream'] = wdf.apply( lambda x : len( re.findall('cream',x.values[0] ) ), axis=1)
print wdf
In [ ]: